perf: move EG() and CG() in ZTS builds into __thread storage by henderkes · Pull Request #22231 · php/php-src

henderkes · 2026-06-05T13:00:29Z

So, I was chasing what caused the much higher instruction count on aarch64 for clang vs gcc and while figuring that out, got side tracked to figuring out that ZTS builds also have a ton of extra instructions compared to NTS builds. Turned out it was mostly just from constantly accessing executor and compiler globals when running phpstan. Each access became pseudo assembly:

mrs _tsrm_ls_cache
add cache offset low bits
add cache offset high bits
load relative base address
load accessed global

Here we're slashing one extra load and one add by moving EG and CG onto actual thread storage. Benchmarks are using phpstan on full laravel codebase (full symfony codebase showed the same, but I added the LE optimization after and don't want to rerun 6 hours of benchmarks) and phoronix-test-suite phpbench.

aarch64:

~80% fewer extra instructions
Ampere A1 cloud machine, ~0.15% variance in wall clock between runs

build	PHPStan cyc	instr	wall	phpbench	phpbench JIT
gcc-nts	90.67B	108.22B	32.0s	715719	1164671
gcc-zts	94.44B	116.28B	33.2s	662475	1111162
gcc-zts-patch	91.61B	109.66B	32.3s	704490	1146721
clang-nts	95.20B	113.29B	33.4s	709077	1118672
clang-zts	97.92B	120.82B	34.4s	665074	1069905
clang-zts-patch	93.40B	114.72B	33.1s	696929	1101997

x86_64:

~35% fewer extra instructions
no cycle count on x86_64 because lack of hardware counter support in WSL
i7 12650h with locked cpu frequency, take the wall times with a grain of salt:

build	PHPStan cyc	instr	wall	phpbench	phpbench JIT
gcc-nts	n/a	11.01B	113.91s	1065687	1793058
gcc-zts	n/a	11.23B	115.47s	1033552	1767360
gcc-zts-patch	n/a	11.15B	116.34s	1005077	1713823
clang-nts	n/a	11.38B	112.92s	1125988	1795096
clang-zts	n/a	11.64B	114.67s	1087697	1757513
clang-zts-patch	n/a	11.53B	112.22s	1145744	1793294

LLM disclosure

claude

adjusted the JIT
transformed the raw bash results into these (hopefully beautiful) html tables
edit: it failed at the tables, or I failed copy-pasting

henderkes · 2026-06-05T15:30:11Z

Meant to write comments explaining a few peculiar decisions... but got carried away in the evening. Will get to it and tag relevant reviewers in the morning.

henderkes · 2026-06-06T02:44:57Z


 static void ts_free_resources(tsrm_tls_entry *thread_resources)
 {
+	bool own_thread = thread_resources->thread_id == tsrm_thread_id();


Here it's possible that a thread id was recycled, but the tls points to a now obsolete one

henderkes · 2026-06-06T02:45:41Z

 			}

-			if (!resource_types_table[i].fast_offset) {
+			if (!resource_types_table[i].fast_offset && !resource_types_table[i].tls_addr) {


we can't manually free __thread storage

henderkes · 2026-06-06T02:47:45Z

+/* allocates a resource id whose per-thread storage is a native __thread block */
+TSRM_API ts_rsrc_id ts_allocate_tls_id(ts_rsrc_id *rsrc_id, void *(*tls_addr)(void), size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor)
+{/*{{{*/
+	TSRM_ERROR((TSRM_ERROR_LEVEL_CORE, "Obtaining a new TLS resource id, %d bytes", size));


function largely copied from above, looking at it now I see that size_t should be printed as %zu.

henderkes · 2026-06-06T02:53:41Z

 # define TSRM_TLS_MODEL_ATTR
 # define TSRM_TLS_MODEL_DEFAULT
-#elif __PIC__
+#elif __PIC__ && !defined(__PIE__)


a PIE program can use local exec if it's the main executable. Only shared libraries (embed, extensions) need to fall back to initial-exed.

This alone would already be a small speedup (one fewer instruction per access)

henderkes · 2026-06-06T03:08:47Z

+    AS_VAR_APPEND([CFLAGS], [" -DZEND_EG_TLS"])
+
+    dnl -mtls-size=12 drops the dead high-bits offset add from TLS access,
+    dnl valid while the thread-local block stays under 4 KiB.


This would produce linker errors if tls size exceeded 4kb, but I did a test atatic build with 100 extensions statically compiled in (what a terrible idea) and tls size ended up at 3.7kb. I can't think of a way to test whether a link would succeed before compiling, so this is unconditional.

henderkes · 2026-06-06T03:23:30Z

@arnaud-lb I also had a different idea on achieving the same thing by pinning resolved EG and CG to a register on gcc. It worked well, but

I wouldn't know how to guarantee the extra register doesn't regress performance in some scenarios
libxml & other extensions clobbered any register I picked, so it took a large changelist
is hybrid-vm only, for Clang I tried passing them as tailcall params, but it became very messy

Perhaps you have a different idea here?

arnaud-lb

This is a good idea, but you will need to check that it works in various configurations:

Accessing TLS variables from other modules (shared extensions) used to be broken on some platforms (module wouldn't link or wouldn't load). I don't know if that's still the case, but that's one reason why extensions have their own _tsrm_ls_cache instead of accessing the main one.
local-exec / initial-exec TLS when php is itself a shared library (apache module, embed) is allocated from the TLS surplus which is non portable and limited in size. The size of EG+CG might be larger than surplus. I don't remember why we specify the model explicitly but the loader patches the code to use more efficient models when applicable, so this may not be necessary. (If removing TSRM_TLS_MODEL_ATTR breaks JIT I can help fixing it.)

Register pinning on the hybrid VM would work, but it may increase register pressure. I've tried adding new VM arguments recently for another reason, and it resulted in a performance regression. Global registers would work too, as long as it's a callee-saved register (prevents clobbering by 3rd party libs) (but this increases register pressure everywhere the variable is in scope).

Some idea: EG()/CG() and a few others use ZEND_TSRMG_FAST(), which is _tsrm_ls_cache+some_offset. So EG(foo) is load(_tsrm_ls_cache)+load(some_offset)+offsetof(zend_executor_globals,foo). We could try to make the offset a const expr so that it folds to load(_tsrm_ls_cache)+const:

Create a struct that contains all fast TLS entries (see php_reserve_tsrm_memory)
Replace php_reserve_tsrm_memory() by tsrm_reserve(sizeof(the_struct)).
#define EG(v) ZEND_TSRMG_FAST(offsetof(the_struct, eg), zend_executor_globals *, v)

arnaud-lb · 2026-06-06T08:42:05Z

+ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_executor_globals executor_globals_tls;
+ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_compiler_globals compiler_globals_tls;


Possibly these could be embeded in the main _tsrm_ls_cache (TSRMLS_MAIN_CACHE_DEFINE) :

ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR struct { void *cache; zend_executor_globals eg; zend_compiler_globals cg; } _tsrm_ls_cache;

This would simplify JIT changes as we can access eg/cg at an offset from jit->tls.

Also the compiler may generate better code when both EG and CG are used in the same function.

That's a great idea, I'll give it a spin.

Looking at it, it introduces tsrm macro changes in many places that assume _tsrm_ls_cache is a void*. Could deref address of .cache to keep callers as void* consumers without pulling in the whole struct.

It simplifies the jit a lot though and leaves the ability to move more things later, so I'll probably do it later.

I'll need to redo the jit by hand, llm missed too many crucial parts and happily regressed if ie wasn't available (aka macos and Windows).

henderkes · 2026-06-06T10:17:31Z

This is a good idea, but you will need to check that it works in various configurations:

Accessing TLS variables from other modules (shared extensions) used to be broken on some platforms (module wouldn't link or wouldn't load). I don't know if that's still the case, but that's one reason why extensions have their own _tsrm_ls_cache instead of accessing the main one.

I tested all built in extensions as shared and the test suites ran. With the latest change I also enabled this for Windows though, so I'll need to get back to the benching and testing board.

local-exec / initial-exec TLS when php is itself a shared library (apache module, embed) is allocated from the TLS surplus which is non portable and limited in size.

Hmm, I haven't tested the apache module, but our simple embed test and frankenphp work fine.

The size of EG+CG might be larger than surplus. I don't remember why we specify the model explicitly but the loader patches the code to use more efficient models when applicable, so this may not be necessary. (If removing TSRM_TLS_MODEL_ATTR breaks JIT I can help fixing it.)

It's permitted to, but not guaranteed to. Clang is pretty good at it, but in GCC aarch64 instruction count goes up 3% by compiling as PIC (without this patch). As for the surplus, this wouldn't change here, as EG/CG are part of the initial load set. Anything that fit in the surplus before will still fit now. The only exception to this would be something like dlopen called on libphp.so from a program that doesn't link against it... not sure how well that'd work anyhow.

Register pinning on the hybrid VM would work, but it may increase register pressure. I've tried adding new VM arguments recently for another reason, and it resulted in a performance regression. Global registers would work too, as long as it's a callee-saved register (prevents clobbering by 3rd party libs) (but this increases register pressure everywhere the variable is in scope).

Some idea: EG()/CG() and a few others use ZEND_TSRMG_FAST(), which is _tsrm_ls_cache+some_offset. So EG(foo) is load(_tsrm_ls_cache)+load(some_offset)+offsetof(zend_executor_globals,foo). We could try to make the offset a const expr so that it folds to load(_tsrm_ls_cache)+const:

Create a struct that contains all fast TLS entries (see php_reserve_tsrm_memory)

Replace php_reserve_tsrm_memory() by tsrm_reserve(sizeof(the_struct)).

#define EG(v) ZEND_TSRMG_FAST(offsetof(the_struct, eg), zend_executor_globals *, v)

I did this with the gcc register to get rid of the two dependent loads and the offset and it worked, using fewer instructions than NTS, but it relies on global registers. Toyed around ideas for Clang but they weren't satisfactory. We could of course use the approach without global register variables, but then it only gets rid of an independent offset load while keeping the two dependent loads, while this PR's approach already brings it down to a single load.

It would however be worth doing alongside this PR, just for the other tsrm globals, even though they're accessed so rarely I'm not sure it really matters.

arnaud-lb · 2026-06-06T12:29:44Z

The only exception to this would be something like dlopen called on libphp.so from a program that doesn't link against it... not sure how well that'd work anyhow.

This is the case of the apache module

henderkes added 2 commits June 5, 2026 18:42

move EG() and CG() to __thread storage

5801720

fix jit

08c6319

henderkes requested a review from dstogov as a code owner June 5, 2026 13:00

github-actions Bot added Category: Build System Category: Engine Extension: opcache ABI break labels Jun 5, 2026

henderkes commented Jun 6, 2026

View reviewed changes

remove unused old paths

7d846b4

henderkes marked this pull request as draft June 6, 2026 06:56

arnaud-lb reviewed Jun 6, 2026

View reviewed changes

fix windows

a4ec0fc

		ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_executor_globals executor_globals_tls;
		ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_compiler_globals compiler_globals_tls;

Conversation

henderkes commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLM disclosure

Uh oh!

henderkes commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henderkes commented Jun 6, 2026

Uh oh!

arnaud-lb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henderkes commented Jun 6, 2026

Uh oh!

arnaud-lb commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

henderkes commented Jun 5, 2026 •

edited

Loading

arnaud-lb left a comment •

edited

Loading